Covid-19 by Political Party

2021-08-16

Covid-19 by State and political party

The purpose of this project is to imitate a research article, obtain data from websites using a web scrapper in R, and to create visualizations surrounding based on covid-19 infections and the affiliated party of the state governor. Within this project: basic web scraping techniques were used to pull data from different sources such as wikipedia; in addition, the datasets were joined together using dplyr’s inner_join. After processing the data, simple visualization methods were used for a superficial exploration of the data, and a linear regression was used to compare the means of the percent infected by the governing official’s political party by state.

##Loading in Datasets

url <- "https://simple.wikipedia.org/wiki/List_of_United_States_governors"
political_party <- read_html(url) %>%
  html_node("table") %>%
  html_table(header = T) %>%
  row_to_names(1) %>%
  select(!c(2,4,10,11))
## Warning in row_to_names(., 1): Row 1 does not provide unique names. Consider
## running clean_names() after row_to_names().
abbr <- tibble(data = abbr) %>%
  separate(data, into = c("Abbr", "State"), sep = " – ")

covid <- fread("Data/United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv") %>%
  filter(state %in% abbr$Abbr) %>%
  mutate(submission_date = as.Date(submission_date, format = "%m/%d/%Y")) %>%
  group_by(state) %>%
  filter(submission_date == max(submission_date))

census <- read.csv("Data/csvData.csv") %>%
  select(!`ï..rank`)

Visualizations

covid_PP <- covid %>%
  inner_join(abbr, by = c("state" = "Abbr")) %>%
  inner_join(political_party, by = "State") %>%
  inner_join(census, by = "State") %>%
  mutate(perc_cases = tot_cases/Pop2018 * 100,
         perc_death_by_cases = tot_death/tot_cases * 100,
         perc_death_by_pop = tot_death/Pop2018 * 100,
         Party = str_extract(Party, "Republican|Democratic")) %>%
  select(State, Party, perc_death_by_pop, perc_death_by_cases, perc_cases, tot_cases, tot_death)
## Adding missing grouping variables: `state`
ggplot(covid_PP, aes(x = Party, y = perc_cases, fill = Party)) +
  geom_boxplot(fill = c("Blue", "Red")) +
  ylab("Percent Cases") +
  ggtitle("Boxplot of percent cases by political party")

Based on the above box plot, it can be noted that there is a difference in the median percent cases of republican and democratic states; however, given the chart itself, there is not clear evidence whether there is a statistical difference between the medians of percent cases for the total population given the affiliated political party of the state leader. It is entirely possible that this difference could be due to a variation assuming the data is representative of the states situation.

plot_geo(covid_PP, locationmode = 'USA-states') %>%
  add_trace(
    z = ~perc_cases, locations = ~state,
    color = ~perc_cases, colors = "Blues"
  ) %>%
  layout(
    title = "Percent Infected by State",
    geo = 
      list(
      scope = 'usa',
      projection = list(type = 'albers usa')
      )
  ) %>%
  colorbar(title = "Percent Infected")

Above is a state map of the United States and the respective percent of cases.

ggplotly(ggplot(covid_PP, aes(fct_reorder(State, perc_cases), perc_cases, color = Party)) +
  geom_point() +
  geom_hline(yintercept = mean(covid_PP$perc_cases), linetype = "dashed") +
  scale_colour_manual(values = c("Blue", "Red")) +
  labs(x = "State", y = "Percent of Population Infected", title = "Scatter plot and mean of Percent cases by state and political party") +
  
  coord_flip() +
  theme_minimal())

Based on the plot above, it is am important question to ask why the state Rhode Island appears to be an outlier when compared to all of the democratic states. Another state which appears to be an outlier is Hawaii but there are some simple explanations as to why it was able to effectively reduce the transmission of covid cases and that is because it is not apart of the 48 contiguous states and that it enacted strict lockdown measures at the beginning of covid-19. Alaska can the US territory of Puerto Rico can directly be compared against hawaii as they also are not contiguous states/territories.

Statistical Test

Assuming that the data is representative of the true situation of each state, based on the exploratory data analysis it would be important to test if there were a difference between the mean percent of cases between republican and democratic led states or if there is no difference at all. The hypothesis test would be as state: \[H_0: \mu_R - \mu_D = 0\] \[H_a: \mu_R - \mu_D \ne 0\]

var.test(perc_cases ~ Party, data = covid_PP)
## 
##  F test to compare two variances
## 
## data:  perc_cases by Party
## F = 1.4637, num df = 22, denom df = 26, p-value = 0.3504
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6522116 3.3795229
## sample estimates:
## ratio of variances 
##            1.46374
lm(covid_PP$perc_cases ~ covid_PP$Party, alternative = "two.sided", paired = F, var.equal = FALSE) %>%
  summary()
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra arguments 'alternative', 'paired', 'var.equal' will be disregarded
## 
## Call:
## lm(formula = covid_PP$perc_cases ~ covid_PP$Party, alternative = "two.sided", 
##     paired = F, var.equal = FALSE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4897 -0.6306  0.4765  1.6588  4.9603 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                9.6236     0.5211  18.467   <2e-16 ***
## covid_PP$PartyRepublican   1.5821     0.7091   2.231   0.0304 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.499 on 48 degrees of freedom
## Multiple R-squared:  0.09396,    Adjusted R-squared:  0.07508 
## F-statistic: 4.978 on 1 and 48 DF,  p-value: 0.03039
t.test(perc_cases ~ Party, data = covid_PP, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  perc_cases by Party
## t = -2.231, df = 48, p-value = 0.03039
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.0079840 -0.1563017
## sample estimates:
## mean in group Democratic mean in group Republican 
##                 9.623559                11.205701

The statistical test above shows that there is a difference (assuming the data is representative) with a p-value of 0.0304. Although there is a statistical difference between the means, it is important to state that this data is observational and does not mean causation.